133 research outputs found

    On the Optimality of Averaging in Distributed Statistical Learning

    Full text link
    A common approach to statistical learning with big-data is to randomly split it among mm machines and learn the parameter of interest by averaging the mm individual estimates. In this paper, focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: First, a classical setting where the number of parameters pp is fixed, and the number of samples per machine n→∞n\to\infty. Second, a high-dimensional regime where both p,n→∞p,n\to\infty with p/n→κ∈(0,1)p/n \to \kappa \in (0,1). For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-pp setting, under suitable assumptions, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second order error terms, and show that these can be non-negligible, notably for non-linear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which to leading order increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.Comment: Major changes from previous version. Particularly on the second order error approximation and implication

    Revisiting Multi-Subject Random Effects in fMRI: Advocating Prevalence Estimation

    Full text link
    Random Effects analysis has been introduced into fMRI research in order to generalize findings from the study group to the whole population. Generalizing findings is obviously harder than detecting activation in the study group since in order to be significant, an activation has to be larger than the inter-subject variability. Indeed, detected regions are smaller when using random effect analysis versus fixed effects. The statistical assumptions behind the classic random effects model are that the effect in each location is normally distributed over subjects, and "activation" refers to a non-null mean effect. We argue this model is unrealistic compared to the true population variability, where, due to functional plasticity and registration anomalies, at each brain location some of the subjects are active and some are not. We propose a finite-Gaussian--mixture--random-effect. A model that amortizes between-subject spatial disagreement and quantifies it using the "prevalence" of activation at each location. This measure has several desirable properties: (a) It is more informative than the typical active/inactive paradigm. (b) In contrast to the hypothesis testing approach (thus t-maps) which are trivially rejected for large sample sizes, the larger the sample size, the more informative the prevalence statistic becomes. In this work we present a formal definition and an estimation procedure of this prevalence. The end result of the proposed analysis is a map of the prevalence at locations with significant activation, highlighting activations regions that are common over many brains

    Theoretical Foundations and Empirical Evaluations of Partisan Fairness in District-Based Democracies

    Get PDF
    We clarify the theoretical foundations of partisan fairness standards for district-based democratic electoral systems, including essential assumptions and definitions not previously recognized, formalized, or in some cases even discussed. We also offer extensive empirical evidence for assumptions with observable implications. We cover partisan symmetry, the most commonly accepted fairness standard, and other perspectives. Throughout, we follow a fundamental principle of statistical inference too often ignored in this literature—defining the quantity of interest separately so its measures can be proven wrong, evaluated, and improved. This enables us to prove which of the many newly proposed fairness measures are statistically appropriate and which are biased, limited, or not measures of the theoretical quantity they seek to estimate at all. Because real-world redistricting and gerrymandering involve complicated politics with numerous participants and conflicting goals, measures biased for partisan fairness sometimes still provide useful descriptions of other aspects of electoral systems

    Responsible Behavior for Constellations and Clusters

    Get PDF
    Many large constellations are being considered for deployment over the next ten years into low earth orbit (LEO). This paper seeks to quantify the risks that these constellations pose to the debris environment, the risks that the debris environment poses to these constellations, and the risks that these constellations pose to themselves. The three representative constellations examined in detail in this paper are operated (or planned to be operated) by Spire Global, Iridium, and OneWeb. This paper provides a balanced risk analysis including collision risk, operational risk, and non-adherence risk. For perspective, the risk posed by these economically useful constellations is compared to the risk associated with existing abandoned hardware deposited in clusters

    Theoretical Foundations and Empirical Evaluations of Partisan Fairness in District-Based Democracies

    Get PDF
    We clarify the theoretical foundations of partisan fairness standards for district-based democratic electoral systems, including essential assumptions and definitions not previously recognized, formalized, or in some cases even discussed. We also offer extensive empirical evidence for assumptions with observable implications. We cover partisan symmetry, the most commonly accepted fairness standard, and other perspectives. Throughout, we follow a fundamental principle of statistical inference too often ignored in this literature—defining the quantity of interest separately so its measures can be proven wrong, evaluated, and improved. This enables us to prove which of the many newly proposed fairness measures are statistically appropriate and which are biased, limited, or not measures of the theoretical quantity they seek to estimate at all. Because real-world redistricting and gerrymandering involve complicated politics with numerous participants and conflicting goals, measures biased for partisan fairness sometimes still provide useful descriptions of other aspects of electoral systems

    Modeling and Analysing Respondent Driven Sampling as a Counting Process

    Full text link
    Respondent-driven sampling (RDS) is an approach to sampling design and analysis which utilizes the networks of social relationships that connect members of the target population, using chain-referral methods to facilitate sampling. RDS typically leads to biased sampling, favoring participants with many acquaintances. Naive estimates, such as the sample average, which are uncorrected for the sampling bias, will themselves be biased. To compensate for this bias, current methodology suggests inverse-degree weighting, where the "degree" is the number of acquaintances. This stems from the fundamental RDS assumption that the probability of sampling an individual is proportional to their degree. Since this assumption is tenuous at best, we propose to harness the additional information encapsulated in the time of recruitment, into a model-based inference framework for RDS. This information is typically collected by researchers, but ignored. We adapt methods developed for inference in epidemic processes to estimate the population size, degree counts and frequencies. While providing valuable information in themselves, these quantities ultimately serve to debias other estimators, such a disease's prevalence. A fundamental advantage of our approach is that, being model-based, it makes all assumptions of the data-generating process explicit. This enables verification of the assumptions, maximum likelihood estimation, extension with covariates, and model selection. We develop asymptotic theory, proving consistency and asymptotic normality properties. We further compare these estimators to the standard inverse-degree weighting through simulations, and using real-world data. In both cases we find our estimators to outperform current methods. The likelihood problem in the model we present is convex, and thus efficiently solvable. We implement these estimators in an R package, chords, available on CRAN.Comment: 16 page

    What's in a pattern? Examining the Type of Signal Multivariate Analysis Uncovers At the Group Level

    Get PDF
    Multivoxel pattern analysis (MVPA) has gained enormous popularity in the neuroimaging community over the past few years. At the group level, most MVPA studies adopt an "information based" approach in which the sign of the effect of individual subjects is discarded and a non-directional summary statistic is carried over to the second level. This is in contrast to a directional "activation based" approach typical in univariate group level analysis, in which both signal magnitude and sign are taken into account. The transition from examining effects in one voxel at a time vs. several voxels (univariate vs. multivariate) has thus tacitly entailed a transition from directional to non-directional signal definition at the group level. While a directional group-level MVPA approach implies that individuals have similar multivariate spatial patterns of activity, in a non-directional approach each individual may have a distinct spatial pattern. Using an experimental dataset, we show that directional and non-directional group-level MVPA approaches uncover distinct brain regions with only partial overlap. We propose a method to quantify the degree of spatial similarity in activation patterns over subjects. Applied to an auditory task, we find higher values in auditory regions compared to control regions.Comment: Revised versio

    Estimation of hourly near surface air temperature across Israel using an ensemble model

    Get PDF
    Mapping of near-surface air temperature (Ta) at high spatio-temporal resolution is essential for unbiased assessment of human health exposure to temperature extremes, not least given the observed trend of urbanization and global climate change. Data constraints have led previous studies to focus merely on daily Ta metrics, rather than hourly ones, making them insufficient for intra-day assessment of health exposure. In this study, we present a three-stage machine learning-based ensemble model to estimate hourly Ta at a high spatial resolution of 1 × 1 km2, incorporating remotely sensed surface skin temperature (Ts) from geostationary satellites, reanalysis synoptic variables, and observations from weather stations, as well as auxiliary geospatial variables, which account for spatio-temporal variability of Ta. The Stage 1 model gap-fills hourly Ts at 4 × 4 km2 from the Spinning Enhanced Visible and InfraRed Imager (SEVIRI), which are subsequently fed into the Stage 2 model to estimate hourly Ta at the same spatio-temporal resolution. The Stage 3 model downscales the residuals between estimated and measured Ta to a grid of 1 × 1 km2, taking into account additionally the monthly diurnal pattern of Ts derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) data. In each stage, the ensemble model synergizes estimates from the constituent base learners—random forest (RF) and extreme gradient boosting (XGBoost)—by applying a geographically weighted generalized additive model (GAM), which allows the weights of results from individual models to vary over space and time. Demonstrated for Israel for the period 2004–2017, the proposed ensemble model outperformed each of the two base learners. It also attained excellent five-fold cross-validated performance, with overall root mean square error (RMSE) of 0.8 and 0.9 °C, mean absolute error (MAE) of 0.6 and 0.7 °C, and R2 of 0.95 and 0.98 in Stage 1 and Stage 2, respectively. The Stage 3 model for downscaling Ta residuals to 1 km MODIS grids achieved overall RMSE of 0.3 °C, MAE of 0.5 °C, and R2 of 0.63. The generated hourly 1 × 1 km2 Ta thus serves as a foundation for monitoring and assessing human health exposure to temperature extremes at a larger geographical scale, helping to further minimize exposure misclassification in epidemiological studies
    • …
    corecore